# installation of packages in case the user has not installed yet
requirements <- c("nnet", "MASS", "e1071", "class", "leaps", "glmnet", "car", "caTools",
"pROC", "summarytools", "dplyr", "ggplot2", "tidyverse", "lubridate",
"mapview", "sf", "geojsonio", "leaflet", "broom", "gridExtra", "leafpop")
for (req in requirements){
if (!require(req, character.only = TRUE)){
install.packages(req)
}
}
## Caricamento del pacchetto richiesto: nnet
## Caricamento del pacchetto richiesto: MASS
## Caricamento del pacchetto richiesto: e1071
## Caricamento del pacchetto richiesto: class
## Caricamento del pacchetto richiesto: leaps
## Caricamento del pacchetto richiesto: glmnet
## Caricamento del pacchetto richiesto: Matrix
## Loaded glmnet 4.1-8
## Caricamento del pacchetto richiesto: car
## Caricamento del pacchetto richiesto: carData
## Caricamento del pacchetto richiesto: caTools
## Caricamento del pacchetto richiesto: pROC
## Type 'citation("pROC")' for a citation.
##
## Caricamento pacchetto: 'pROC'
## I seguenti oggetti sono mascherati da 'package:stats':
##
## cov, smooth, var
## Caricamento del pacchetto richiesto: summarytools
## Caricamento del pacchetto richiesto: dplyr
##
## Caricamento pacchetto: 'dplyr'
## Il seguente oggetto è mascherato da 'package:car':
##
## recode
## Il seguente oggetto è mascherato da 'package:MASS':
##
## select
## I seguenti oggetti sono mascherati da 'package:stats':
##
## filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
##
## intersect, setdiff, setequal, union
## Caricamento del pacchetto richiesto: ggplot2
## Caricamento del pacchetto richiesto: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ dplyr::recode() masks car::recode()
## ✖ dplyr::select() masks MASS::select()
## ✖ purrr::some() masks car::some()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ✖ tibble::view() masks summarytools::view()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Caricamento del pacchetto richiesto: mapview
##
## Caricamento del pacchetto richiesto: sf
##
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
##
## Caricamento del pacchetto richiesto: geojsonio
##
## Registered S3 method overwritten by 'geojsonsf':
## method from
## print.geojson geojson
##
##
## Caricamento pacchetto: 'geojsonio'
##
##
## Il seguente oggetto è mascherato da 'package:base':
##
## pretty
##
##
## Caricamento del pacchetto richiesto: leaflet
##
## Caricamento del pacchetto richiesto: broom
##
## Caricamento del pacchetto richiesto: gridExtra
##
##
## Caricamento pacchetto: 'gridExtra'
##
##
## Il seguente oggetto è mascherato da 'package:dplyr':
##
## combine
##
##
## Caricamento del pacchetto richiesto: leafpop
The Fire Incident Dispatch Data file contains data that is generated by the Starfire Computer Aided Dispatch System. The data spans from the time the incident is created in the system to the time the incident is closed in the system. It covers information about the incident as it relates to the assignment of resources and the Fire Department’s response to the emergency. To protect personal identifying information in accordance with the Health Insurance Portability and Accountability Act (HIPAA), specific locations of incidents are not included and have been aggregated to a higher level of detail.
In this analysis we have restricted the number of observations to the last 50.000, from 5th of September to 30th of the same month.
The dataset is formed by the following columns:
STARFIRE_INCIDENT_ID: An incident identifier comprising the 5 character julian date, 4 character alarm box number, 2 character number of incidents at the box so far for the day, 1 character borough code , 4 character sequence number.
INCIDENT_DATETIME: The date and time of the incident.
ALARM_BOX_BOROUGH: The borough of the alarm box.
ALARM_BOX_LOCATION: The location of the alarm box.
ALARM_BOX: The alarm box number.
INCIDENT_BOROUGH: The borough of the incident.
ZIPCODE: The zip code of the incident.
POLICEPRECINCT: The police precinct of the incident.
CITYCOUNCILDISTRICT: The city council district.
COMMUNITYDISTRICT: The community district.
COMMUNITYSCHOOLDISTRICT: The community school district.
CONGRESSIONALDISTRICT: The congressional district.
ALARM_SOURCE_DESCRIPTION_TX: The description of the alarm source.
ALARM_LEVEL_INDEX_DESCRIPTION: The alarm level index.
HIGHEST_ALARM_LEVEL: The highest alarm level.
INCIDENT_CLASSIFICATION: The incident classification.
INCIDENT_CLASSIFICATION_GROUP: The incident classification roll up group.
FIRST_ASSIGNMENT_DATETIME: The date and time of the first unit assignment.
FIRST_ACTIVATION_DATETIME: The date and time of the first unit acknowledgement of the assignment.
FIRST_ON_SCENE_DATETIME: The date and time of the first unit at the scene of the incident.
INCIDENT_CLOSE_DATETIME: The date and time that the incident was closed in the dispatch system.
VALID_DISPATCH_RSPNS_TIME_INDC: Indicates that the components comprising the generation of the DISPATCH_RESPONSE_SECONDS_QY are valid.
DISPATCH_RESPONSE_SECONDS_QY: The elapsed time in seconds between the INCIDENT_DATETIME and the FIRST_ASSIGNMENT_DATETIME.
VALID_INCIDENT_RSPNS_TIME_INDC: Indicates that the components comprising the generation of the INCIDENT_RESPONSE_SECONDS_QY are valid.
INCIDENT_RESPONSE_SECONDS_QY: The elapsed time in seconds between the INCIDENT_DATETIME and the FIRST_ON_SCENE_DATETIME.
INCIDENT_TRAVEL_TM_SECONDS_QY: The elapsed time in seconds between the FIRST_ASSIGNMENT_DATETIME and the FIRST_ON_SCENE_DATETIME.
ENGINES_ASSIGNED_QUANTITY: The number of engine units assigned to the incident.
LADDERS_ASSIGNED_QUANTITY: The number of ladder units assigned to the incident.
OTHER_UNITS_ASSIGNED_QUANTITY: The number of units that are not engines or ladders that were assigned to the incident.
We will try to create two different analyses.
In both analyses we tried to use a linear regression model, however we will see that the assumptions for applying the linear regression are not met, thus we will simplify our project moving into classification, dividing in two or more ranges the two responses. In addition to this we will perform data exploration and cleaning, studying the presence or not of patterns of NA values and invalid values.
The first step is always to read the dataset, plot the first 5 observations and see its whole dimension.
fire_data <- read.csv("datasets/Fire_Incident_Dispatch_Data_last_50k.csv")
head(fire_data)
## STARFIRE_INCIDENT_ID INCIDENT_DATETIME ALARM_BOX_BOROUGH
## 1 230905-B1937-001-0567 09/05/2023 02:19:04 PM BROOKLYN
## 2 230905-B3923-002-0568 09/05/2023 02:19:36 PM BROOKLYN
## 3 230905-X8897-003-0480 09/05/2023 02:19:43 PM BRONX
## 4 230905-X3466-001-0481 09/05/2023 02:21:00 PM BRONX
## 5 230905-B2448-001-0570 09/05/2023 02:21:26 PM BROOKLYN
## 6 230905-B2448-002-0571 09/05/2023 02:22:35 PM BROOKLYN
## ALARM_BOX_NUMBER ALARM_BOX_LOCATION INCIDENT_BOROUGH
## 1 1937 AUTUMN AVE & FULTON ST BROOKLYN
## 2 3923 N/S EASTERN PWAY & UTICA AVE BROOKLYN
## 3 8897 CROSS BX EXPY- DEEGAN EX TO JEROME AV BRONX
## 4 3466 ADEE AVE & BX PARK E BRONX
## 5 2448 GLENWOOD RD & BEDFORD AVE BROOKLYN
## 6 2448 GLENWOOD RD & BEDFORD AVE BROOKLYN
## ZIPCODE POLICEPRECINCT CITYCOUNCILDISTRICT COMMUNITYDISTRICT
## 1 11208 75 37 305
## 2 11213 71 35 309
## 3 NA NA NA NA
## 4 10467 49 12 211
## 5 11210 70 45 314
## 6 11210 70 45 314
## COMMUNITYSCHOOLDISTRICT CONGRESSIONALDISTRICT ALARM_SOURCE_DESCRIPTION_TX
## 1 19 7 EMS
## 2 17 9 CLASS-3
## 3 NA NA EMS-911
## 4 11 15 EMS
## 5 22 9 EMS
## 6 22 9 EMS
## ALARM_LEVEL_INDEX_DESCRIPTION HIGHEST_ALARM_LEVEL
## 1 Initial Alarm First Alarm
## 2 Initial Alarm First Alarm
## 3 Initial Alarm First Alarm
## 4 DEFAULT RECORD First Alarm
## 5 DEFAULT RECORD First Alarm
## 6 DEFAULT RECORD First Alarm
## INCIDENT_CLASSIFICATION INCIDENT_CLASSIFICATION_GROUP
## 1 Medical - No PT Contact EMS is Onscene Medical Emergencies
## 2 Hospital Fire Structural Fires
## 3 Vehicle Accident - Other NonMedical Emergencies
## 4 Medical - EMS Link 10-91 Medical Emergencies
## 5 Medical - EMS Link 10-91 Medical Emergencies
## 6 Medical - EMS Link 10-91 Medical Emergencies
## DISPATCH_RESPONSE_SECONDS_QY FIRST_ASSIGNMENT_DATETIME
## 1 7 09/05/2023 02:19:12 PM
## 2 95 09/05/2023 02:21:11 PM
## 3 41 09/05/2023 02:20:25 PM
## 4 298 09/05/2023 02:25:59 PM
## 5 25 09/05/2023 02:21:52 PM
## 6 350 09/05/2023 02:28:25 PM
## FIRST_ACTIVATION_DATETIME FIRST_ON_SCENE_DATETIME INCIDENT_CLOSE_DATETIME
## 1 09/05/2023 02:19:26 PM 09/05/2023 02:25:23 PM 09/05/2023 03:03:15 PM
## 2 09/05/2023 02:21:33 PM 09/05/2023 02:23:21 PM 09/05/2023 02:34:18 PM
## 3 09/05/2023 02:20:35 PM 09/05/2023 02:26:22 PM 09/05/2023 04:13:32 PM
## 4 09/05/2023 02:26:04 PM 09/05/2023 02:34:23 PM
## 5 09/05/2023 02:22:08 PM 09/05/2023 02:28:07 PM
## 6 09/05/2023 02:29:09 PM
## VALID_DISPATCH_RSPNS_TIME_INDC VALID_INCIDENT_RSPNS_TIME_INDC
## 1 N Y
## 2 N Y
## 3 N Y
## 4 N N
## 5 N N
## 6 N N
## INCIDENT_RESPONSE_SECONDS_QY INCIDENT_TRAVEL_TM_SECONDS_QY
## 1 378 371
## 2 224 129
## 3 398 357
## 4 NA NA
## 5 NA NA
## 6 NA NA
## ENGINES_ASSIGNED_QUANTITY LADDERS_ASSIGNED_QUANTITY
## 1 1 0
## 2 3 2
## 3 2 3
## 4 1 0
## 5 1 0
## 6 1 0
## OTHER_UNITS_ASSIGNED_QUANTITY
## 1 0
## 2 1
## 3 1
## 4 0
## 5 0
## 6 0
dim(fire_data)
## [1] 50000 29
We use dfSummary from summarytool in order to have a complete and clear summary of the dataset.
print(dfSummary(fire_data,
plain.ascii = FALSE,
style = "multiline",
headings = FALSE,
graph.magnif = 0.8,
valid.col = FALSE),
method = 'render')
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | STARFIRE_INCIDENT_ID [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | INCIDENT_DATETIME [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | ALARM_BOX_BOROUGH [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | ALARM_BOX_NUMBER [integer] |
|
7411 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | ALARM_BOX_LOCATION [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | INCIDENT_BOROUGH [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | ZIPCODE [integer] |
|
217 distinct values | 3181 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | POLICEPRECINCT [integer] |
|
77 distinct values | 3180 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | CITYCOUNCILDISTRICT [integer] |
|
51 distinct values | 3180 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | COMMUNITYDISTRICT [integer] |
|
70 distinct values | 3180 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | COMMUNITYSCHOOLDISTRICT [integer] |
|
32 distinct values | 3182 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | CONGRESSIONALDISTRICT [integer] |
|
13 distinct values | 3180 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | ALARM_SOURCE_DESCRIPTION_TX [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | ALARM_LEVEL_INDEX_DESCRIPTION [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | HIGHEST_ALARM_LEVEL [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | INCIDENT_CLASSIFICATION [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17 | INCIDENT_CLASSIFICATION_GROUP [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 18 | DISPATCH_RESPONSE_SECONDS_QY [integer] |
|
841 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 19 | FIRST_ASSIGNMENT_DATETIME [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 | FIRST_ACTIVATION_DATETIME [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21 | FIRST_ON_SCENE_DATETIME [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 22 | INCIDENT_CLOSE_DATETIME [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 23 | VALID_DISPATCH_RSPNS_TIME_INDC [character] | 1. N |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 24 | VALID_INCIDENT_RSPNS_TIME_INDC [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 25 | INCIDENT_RESPONSE_SECONDS_QY [integer] |
|
1496 distinct values | 14112 (28.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 26 | INCIDENT_TRAVEL_TM_SECONDS_QY [integer] |
|
1382 distinct values | 14112 (28.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 27 | ENGINES_ASSIGNED_QUANTITY [integer] |
|
15 distinct values | 62 (0.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28 | LADDERS_ASSIGNED_QUANTITY [integer] |
|
12 distinct values | 62 (0.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 29 | OTHER_UNITS_ASSIGNED_QUANTITY [integer] |
|
23 distinct values | 62 (0.1%) |
Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-18
Now we decide to rename all the columns so that they have a shorter length so that they are easily visible in future plots.
fire_data <- fire_data %>%
rename(id = STARFIRE_INCIDENT_ID, datetime = INCIDENT_DATETIME, al_borough = ALARM_BOX_BOROUGH,
al_number = ALARM_BOX_NUMBER,al_location = ALARM_BOX_LOCATION, inc_borough = INCIDENT_BOROUGH,
zipcode = ZIPCODE, pol_prec = POLICEPRECINCT, city_con_dist = CITYCOUNCILDISTRICT,
commu_dist = COMMUNITYDISTRICT, commu_sc_dist = COMMUNITYSCHOOLDISTRICT,
cong_dist = CONGRESSIONALDISTRICT, al_source_desc = ALARM_SOURCE_DESCRIPTION_TX,
al_index_desc = ALARM_LEVEL_INDEX_DESCRIPTION, highest_al_level = HIGHEST_ALARM_LEVEL,
inc_class = INCIDENT_CLASSIFICATION, inc_class_group = INCIDENT_CLASSIFICATION_GROUP,
first_ass_datetime = FIRST_ASSIGNMENT_DATETIME, first_act_datetime = FIRST_ACTIVATION_DATETIME,
first_onscene_datetime = FIRST_ON_SCENE_DATETIME, inc_close_datetime = INCIDENT_CLOSE_DATETIME,
disp_resp_sec_qy = DISPATCH_RESPONSE_SECONDS_QY, disp_resp_sec_indc = VALID_DISPATCH_RSPNS_TIME_INDC,
inc_resp_sec_qy = INCIDENT_RESPONSE_SECONDS_QY, inc_resp_sec_indc = VALID_INCIDENT_RSPNS_TIME_INDC,
inc_travel_sec_qy = INCIDENT_TRAVEL_TM_SECONDS_QY,
engines_assigned = ENGINES_ASSIGNED_QUANTITY,
ladders_assigned = LADDERS_ASSIGNED_QUANTITY, others_units_assigned = OTHER_UNITS_ASSIGNED_QUANTITY)
As we can see from the summary there are lots of work to do, in fact initially we identify the following problem:
NA values.We start by converting the non factorial columns into factorial one. Then we perform accurate data exploration and cleaning in each macro column set.
# set factorial
fire_data$inc_borough <- as.factor(fire_data$inc_borough)
fire_data$al_borough <- as.factor(fire_data$al_borough)
fire_data$al_source_desc <- as.factor(fire_data$al_source_desc)
fire_data$al_index_desc <- as.factor(fire_data$al_index_desc)
fire_data$highest_al_level <- as.factor(fire_data$highest_al_level)
fire_data$cong_dist <- as.factor(fire_data$cong_dist)
fire_data$disp_resp_sec_indc <- as.factor(fire_data$disp_resp_sec_indc)
levels(fire_data$disp_resp_sec_indc)<- c("N", "Y")
fire_data$inc_resp_sec_indc <- as.factor(fire_data$inc_resp_sec_indc)
levels(fire_data$inc_resp_sec_indc)<- c("N", "Y")
fire_data$inc_class_group <- as.factor(fire_data$inc_class_group)
fire_data$inc_class <- as.factor(fire_data$inc_class)
For the temporal data processing we employ a little of feature engineering. Indeed we decide to add the following predictors:
day_type: a factorial predictor to indicate if the incident day is a week day or nottime_of_day: a factorial predictor that indicates the range of time whenever the incident happens, so Night if the hour is between 0 and 6, Morning if the hour is between 6 and 12, Afternoon if the hour is between 12 and 18 and Evening if the hour is between 18 and 24.emergency_min_qy: which represents the difference between the inc_close_datetime and the first_onscene_datetime.working_hour: indicates if the incident happened between 8AM and 7PM, so the classic working hour.Moreover since we are dealing with datetime we also check if the time differences inc_resp_sec_qy, inc_travel_sec_qy and disp_resp_sec_qy are actually corrects, if not we replace them with the correct one, by performing the subtraction of the relative datetime (see dataset description).
Now we note that the maximum level of the time differences is very high to be considered as seconds so we decided to scale all the time differences in minutes.
summary(fire_data %>% select(inc_resp_sec_qy, inc_travel_sec_qy, disp_resp_sec_qy))
## inc_resp_sec_qy inc_travel_sec_qy disp_resp_sec_qy
## Min. : 18.0 Min. : 0.0 Min. : 2.00
## 1st Qu.: 265.0 1st Qu.: 233.0 1st Qu.: 7.00
## Median : 334.0 Median : 301.0 Median : 19.00
## Mean : 380.7 Mean : 340.5 Mean : 39.96
## 3rd Qu.: 426.0 3rd Qu.: 392.0 3rd Qu.: 40.00
## Max. :7130.0 Max. :7122.0 Max. :9023.00
## NA's :14112 NA's :14112
# scaling
fire_data$inc_resp_sec_qy <- fire_data$inc_resp_sec_qy / 60
fire_data$inc_travel_sec_qy <- fire_data$inc_travel_sec_qy / 60
fire_data$disp_resp_sec_qy <- fire_data$disp_resp_sec_qy / 60
# renaming both quantity and indicator predictors for the two datetime
fire_data <- fire_data %>% rename(inc_resp_min_qy = inc_resp_sec_qy,
inc_travel_min_qy = inc_travel_sec_qy, disp_resp_min_qy = disp_resp_sec_qy, # quantity
inc_resp_min_indc = inc_resp_sec_indc, disp_resp_min_indc = disp_resp_sec_indc) # indicator
Perform the datetime feature engineering that we have discussed before.
# Process datetime column
fire_data$datetime <- mdy_hms(fire_data$datetime)
fire_data$first_ass_datetime <- mdy_hms(fire_data$first_ass_datetime)
fire_data$first_act_datetime <- mdy_hms(fire_data$first_act_datetime)
fire_data$first_onscene_datetime <- mdy_hms(fire_data$first_onscene_datetime)
fire_data$inc_close_datetime <- mdy_hms(fire_data$inc_close_datetime)
# checking if the differences are well computed if not change with the correct one
if (!identical(fire_data$inc_resp_min_qy,
as.numeric(difftime(
fire_data$first_onscene_datetime, fire_data$datetime, units="mins")))){
fire_data$inc_resp_min_qy <- as.numeric(
difftime(fire_data$first_onscene_datetime, fire_data$datetime, units="mins"))
}
if (!identical(fire_data$inc_travel_min_qy, as.numeric(
difftime(fire_data$first_onscene_datetime, fire_data$first_ass_datetime, units="mins")))){
fire_data$inc_travel_min_qy <- as.numeric(
difftime(fire_data$first_onscene_datetime, fire_data$first_ass_datetime, units="mins"))
}
if (!identical(fire_data$disp_resp_min_qy, as.numeric(difftime(
fire_data$first_ass_datetime, fire_data$datetime, units="mins")))){
fire_data$disp_resp_min_qy <- as.numeric(difftime(
fire_data$first_ass_datetime, fire_data$datetime, units="mins"))
}
# creating emergency_min_qy which describe the time taken by the firefighter to close the emergency after have been arrived to the location
fire_data$emergency_min_qy <- as.numeric(difftime(
fire_data$inc_close_datetime, fire_data$first_onscene_datetime, units="mins"))
# creating day_type
fire_data$day_type <- as.factor(
ifelse(weekdays(fire_data$datetime) %in% c("sabato", "domenica"), "Weekend", "Weekday"))
# creating working_hour
fire_data$working_hour <- as.factor(ifelse(
hour(fire_data$datetime) >= 19 | hour(fire_data$datetime) < 8, FALSE, TRUE))
# creating time_of_day
fire_data$time_of_day <- cut(
hour(fire_data$datetime),
breaks = c(0, 6, 12, 18, 24),
labels = c("Night", "Morning", "Afternoon", "Evening"),
include.lowest = TRUE,
right = TRUE
)
# delete the datetime
fire_data$datetime <- NULL
Check the distribution of time_of_day.
table(fire_data$time_of_day)
##
## Night Morning Afternoon Evening
## 8521 13270 16499 11710
ggplot(data=fire_data %>%
group_by(time_of_day) %>%
summarise(incident_number = n()),
aes(x=time_of_day, y=incident_number)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=incident_number), vjust=1.6, color="white", position = position_dodge(0.9), size=3.5) +
labs(title = "Time of the Day - Incident Count", x = "Time of the Day", y = "Incident Count")
From this we can see that the higher number of fire incidents is registered from 12 PM to 18 PM, whereas the lower number of fire incidents happened from 00 AM to 06 AM.
Check the distribution of day_type.
table(fire_data$day_type)
##
## Weekday Weekend
## 36482 13518
day_type_table <- table(fire_data$day_type)
day_type_table[1] <- day_type_table[1] / 5
day_type_table[2] <- day_type_table[2] / 2
day_type_table
##
## Weekday Weekend
## 7296.4 6759.0
And in proportion we can see that on average there is a higher number of fire incidents on the weekdays compared to the weekend days.
Check the distribution of working_hour.
table(fire_data$working_hour)
##
## FALSE TRUE
## 21859 28141
table(fire_data$working_hour) / dim(fire_data)[1]
##
## FALSE TRUE
## 0.43718 0.56282
It seems like there are slightly more incidents during the working hour.
Rename the factor levels for the inc_borough and al_borough.
fire_data <- fire_data %>% mutate(inc_borough = recode_factor(
inc_borough, "BRONX" = "Bronx", "BROOKLYN" = "Brooklyn", "MANHATTAN" = "Manhattan",
"QUEENS" = "Queens", "RICHMOND / STATEN ISLAND" = "Staten Island"),
al_borough = recode_factor(
al_borough, "BRONX" = "Bronx", "BROOKLYN" = "Brooklyn", "MANHATTAN" = "Manhattan",
"QUEENS" = "Queens", "RICHMOND / STATEN ISLAND" = "Staten Island"))
Regarding the Spatial Data we decide to keep the borough and also the congressional district since these are the two predictors that have the least number of categories. Further in the section of data visualization we have a look on an interactive map of some relevant information for both aspects.
fire_data <- fire_data %>% mutate(inc_class_group = recode_factor(
inc_class_group, "Medical Emergencies" = "M_E",
"NonMedical Emergencies" = "NonM_E", "Medical MFAs" = "M_MFAs",
"NonMedical MFAs" = "NonM_MFAs", "Structural Fires" = "S_F",
"NonStructural Fires" = "NonS_F"))
At this point we merge some possible value from factorial predictors to make their value space of smaller size.
Here we merge the following factorial values of highest_al_level: All Hands Working, Second Alarm and Third Alarm into NonFirst Alarm. But before doing so let’s see the actual distribution of values.
table(fire_data$highest_al_level)
##
## All Hands Working First Alarm Second Alarm Third Alarm
## 100 49891 8 1
As we can see the majority of the observations are of the type First Alarm, whereas the other observation reach 109 observation, thus we concatenate the less category values into a single one called NonFirst Alarm
fire_data$highest_alarm_lev_new <- fire_data$highest_al_level
levels(fire_data$highest_alarm_lev_new) <- list(
"First Alarm" = "First Alarm",
"NonFirst Alarm" = c("All Hands Working", "Second Alarm", "Third Alarm")
)
print(ctable(fire_data$highest_al_level,
fire_data$highest_alarm_lev_new, prop = 'n',
totals = FALSE, headings = FALSE),
method = 'render')
| highest_alarm_lev_new | ||
|---|---|---|
| highest_al_level | First Alarm | NonFirst Alarm |
| All Hands Working | 0 | 100 |
| First Alarm | 49891 | 0 |
| Second Alarm | 0 | 8 |
| Third Alarm | 0 | 1 |
Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-18
fire_data$highest_al_level <- fire_data$highest_alarm_lev_new
fire_data$highest_alarm_lev_new <- NULL
Here we merge the following factorial values of al_index_desc: Second Alarm, Third Alarm, 7-5 (All Hands Alarm), 10-76 & 10-77 Signal (Notification Hi-Rise Fire) and 10-75 Signal (Request for all hands alarm) into Others. But before doing so let’s see the actual distribution of values.
table(fire_data$al_index_desc)
##
## 10-75 Signal (Request for all hands alarm)
## 13
## 10-76 & 10-77 Signal (Notification Hi-Rise Fire)
## 3
## 7-5 (All Hands Alarm)
## 100
## DEFAULT RECORD
## 17313
## Initial Alarm
## 32562
## Second Alarm
## 8
## Third Alarm
## 1
As we can see the two major category are DEFAULT RECORD and Initial Alarm, whereas the rest of categories occur very few time respect the main two, thus we decide again to merge them.
fire_data$alarm_level_idx_new <- fire_data$al_index_desc
levels(fire_data$alarm_level_idx_new) <- list(
"DEFAULT RECORD" = "DEFAULT RECORD",
"Initial Alarm" = "Initial Alarm",
"Others" = c("Second Alarm", "Third Alarm", "7-5 (All Hands Alarm)",
"10-76 & 10-77 Signal (Notification Hi-Rise Fire)",
"10-75 Signal (Request for all hands alarm)")
)
print(ctable(fire_data$al_index_desc,
fire_data$alarm_level_idx_new, prop = 'n', totals = FALSE, headings = FALSE),
method = 'render')
| alarm_level_idx_new | |||
|---|---|---|---|
| al_index_desc | DEFAULT RECORD |
Initial Alarm |
Others |
| 10-75 Signal (Request for all hands alarm) | 0 | 0 | 13 |
| 10-76 & 10-77 Signal (Notification Hi-Rise Fire) | 0 | 0 | 3 |
| 7-5 (All Hands Alarm) | 0 | 0 | 100 |
| DEFAULT RECORD | 17313 | 0 | 0 |
| Initial Alarm | 0 | 32562 | 0 |
| Second Alarm | 0 | 0 | 8 |
| Third Alarm | 0 | 0 | 1 |
Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-18
fire_data$al_index_desc <- fire_data$alarm_level_idx_new
fire_data$alarm_level_idx_new <- NULL
Here we merge the following factorial values of al_source_desc: 911, 911TEXT, VERBAL, BARS, ERS, ERS-NC and SOL into Others. But before doing so let’s see the actual distribution of values.
table(fire_data$al_source_desc)
##
## 911 911TEXT BARS CLASS-3 EMS EMS-911 ERS ERS-NC PHONE SOL
## 302 14 1 5025 17178 10520 777 1 15146 5
## VERBAL
## 1031
We decide to maintain as original the highest four and merge the rest into a new category.
fire_data$alarm_source_desc_new <- fire_data$al_source_desc
levels(fire_data$alarm_source_desc_new) <- list(
"PHONE" = "PHONE",
"EMS" = "EMS",
"EMS-911" = "EMS-911",
"CLASS-3" = "CLASS-3",
"Others" = c("911", "911TEXT", "VERBAL", "BARS", "ERS", "ERS-NC", "SOL")
)
print(ctable(fire_data$al_source_desc, fire_data$alarm_source_desc_new,
prop = 'n', totals = FALSE, headings = FALSE),
method = 'render')
| alarm_source_desc_new | |||||
|---|---|---|---|---|---|
| al_source_desc | PHONE | EMS | EMS-911 | CLASS-3 | Others |
| 911 | 0 | 0 | 0 | 0 | 302 |
| 911TEXT | 0 | 0 | 0 | 0 | 14 |
| BARS | 0 | 0 | 0 | 0 | 1 |
| CLASS-3 | 0 | 0 | 0 | 5025 | 0 |
| EMS | 0 | 17178 | 0 | 0 | 0 |
| EMS-911 | 0 | 0 | 10520 | 0 | 0 |
| ERS | 0 | 0 | 0 | 0 | 777 |
| ERS-NC | 0 | 0 | 0 | 0 | 1 |
| PHONE | 15146 | 0 | 0 | 0 | 0 |
| SOL | 0 | 0 | 0 | 0 | 5 |
| VERBAL | 0 | 0 | 0 | 0 | 1031 |
Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-18
fire_data$al_source_desc <- fire_data$alarm_source_desc_new
fire_data$alarm_source_desc_new <- NULL
View again the dataset summary to see the applied changes.
print(dfSummary(fire_data,
plain.ascii = FALSE,
style = "multiline",
headings = FALSE,
graph.magnif = 0.8,
valid.col = FALSE),
method = 'render')
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | id [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | al_borough [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | al_number [integer] |
|
7411 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | al_location [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | inc_borough [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | zipcode [integer] |
|
217 distinct values | 3181 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | pol_prec [integer] |
|
77 distinct values | 3180 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | city_con_dist [integer] |
|
51 distinct values | 3180 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | commu_dist [integer] |
|
70 distinct values | 3180 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | commu_sc_dist [integer] |
|
32 distinct values | 3182 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | cong_dist [factor] |
|
|
3180 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | al_source_desc [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | al_index_desc [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | highest_al_level [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | inc_class [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | inc_class_group [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17 | disp_resp_min_qy [numeric] |
|
850 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 18 | first_ass_datetime [POSIXct, POSIXt] |
|
49509 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 19 | first_act_datetime [POSIXct, POSIXt] |
|
49205 distinct values | 139 (0.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 | first_onscene_datetime [POSIXct, POSIXt] |
|
35552 distinct values | 14112 (28.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21 | inc_close_datetime [POSIXct, POSIXt] |
|
49409 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 22 | disp_resp_min_indc [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 23 | inc_resp_min_indc [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 24 | inc_resp_min_qy [numeric] |
|
1497 distinct values | 14112 (28.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 25 | inc_travel_min_qy [numeric] |
|
1369 distinct values | 14112 (28.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 26 | engines_assigned [integer] |
|
15 distinct values | 62 (0.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 27 | ladders_assigned [integer] |
|
12 distinct values | 62 (0.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28 | others_units_assigned [integer] |
|
23 distinct values | 62 (0.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 29 | emergency_min_qy [numeric] |
|
4429 distinct values | 14112 (28.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 30 | day_type [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 31 | working_hour [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 32 | time_of_day [factor] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-18
The next step is to deal invalid values and delete some un-useful predictors.
First of all we saw the possibility that al_borough and inc_borough represent the same column, let’s check it.
identical(fire_data$al_borough, fire_data$inc_borough)
## [1] TRUE
The column al_borough and inc_borough have the same sequence of values, so we can delete one of the two.
fire_data <- fire_data %>% select(-c(al_borough))
Then we say that all observation in the dataset have the indicator predictor disp_resp_min_indc equal to N, let’s check again and in affirmative case then we can delete both columns.
summary(fire_data$disp_resp_min_indc)
## N Y
## 50000 0
All our observations have non valid disp_resp_min_indc so we could delete both the column indicator and the respective column quantity disp_resp_min_qy. However we note that also in the original full dataset all the observation have the disp_resp_min_indc set to N, which is quite strange, and it seems like there is problem relative to the data acquisition, thus for the moment we decide to keep this time difference.
Now we do a quick check also on the other indicator predictor inc_resp_min_indc
summary(fire_data$inc_resp_min_indc)
## N Y
## 17036 32964
But here we have some observations with valid inc_resp_min_indc, and we will consider only the valid one deleting the one that has a non valid attribute.
However before doing that let’s see the distribution of inc_resp_min_qy around the borough.
ggplot(data=fire_data %>%
group_by(inc_borough, inc_resp_min_indc) %>% summarise(incident_number = n()),
aes(x=inc_borough, y=incident_number, fill=inc_resp_min_indc)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=incident_number), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5) +
scale_fill_brewer(palette="Paired") +
labs(title = "Incident Count - Borouh - Valid Response Time in Minutes",
x = "Borough", y = "Incident Number",
fill = "Valid Response\n Time in Minutes")
## `summarise()` has grouped output by 'inc_borough'. You can override using the
## `.groups` argument.
We can see that the number of fire incidents is higher for the valid response time in minutes with the top level for the Brooklyn borough. But it is very interesting to observe the ratio between the valid and the non valid.
Here we see the ratio of valid inc_resp_min_indc in each borough:
rateo_inc_resp_min_indc <- fire_data %>%
group_by(inc_borough, inc_resp_min_indc) %>%
summarise(incident_number = n()) %>%
mutate(ratio=incident_number/sum(incident_number))
## `summarise()` has grouped output by 'inc_borough'. You can override using the
## `.groups` argument.
ggplot(rateo_inc_resp_min_indc, aes(fill=inc_resp_min_indc,
y=ratio, x=inc_borough)) +
geom_bar(position="fill", stat="identity") +
geom_text(aes(label=scales::percent(ratio)),
position=position_fill(vjust=0.5)) +
labs(title="Borough - Rateo Incident between Valid and Invalid",
x="Borough",
y="Rateo Incident between Valid and Invalid",
fill="Valid Response\nTime in Minutes")
And we can see that Staten Island has the higher number of incidents with valid inc_resp_min_indc , whereas Manhattan has the lower number, but remember that the former has the lowest number of fire incidents and the latter has the higher number of incidents.
Now we do an additional analysis to see if there is some relation between the inc_resp_min_indc and a new predictor the total_assigned_unit which is the sum of all the assigned units.
fire_data$total_assigned_unit <- fire_data$engines_assigned +
fire_data$ladders_assigned + fire_data$others_units_assigned
ggplot(fire_data %>% drop_na(), aes(total_assigned_unit, inc_resp_min_qy)) +
geom_point(aes(colour = inc_resp_min_indc), na.rm = TRUE)+
labs(title = "Total Assigned Units - Response Time In Minutes",
x = "Total Assigned Units",
y = "Response Time In Minutes", colour = "Valid Response\n Time in Minutes")
We note that the majority of fire incidents that have been assigned to a single unit have a high invalid response time. Whereas for a higher number of total units the response time decreases and becomes valid.
So we decide to have a look at the distribution of total assigned units with respect to the response time for the observations that have invalid response time, all grouped by incident class.
ggplot(fire_data %>% filter(inc_resp_min_indc == "N") %>% drop_na()
, aes(total_assigned_unit, inc_resp_min_qy)) +
geom_point(aes(colour = inc_class_group)) +
labs(title = "Total Assigned Units - Response Time In Minutes - Incidnet Class Group",
x = "Total Assigned Units", y = "Response Time In Minutes",
colour = "Incident Class Groups")
And we see a majority of observations being of the incident class Medical Emergencies having assigned a single unit. A closer look at the distribution of invalid response time for each incident class group summarised by the count of incidents confirms what we already see in the previous points chart.
irt_i_class <- arrange(fire_data %>% filter(inc_resp_min_indc == "N") %>%
group_by(inc_class_group) %>% summarise(incident_number = n()),
desc(incident_number))
irt_i_class
## # A tibble: 6 × 2
## inc_class_group incident_number
## <fct> <int>
## 1 M_E 15116
## 2 NonM_E 1589
## 3 NonM_MFAs 162
## 4 S_F 66
## 5 NonS_F 59
## 6 M_MFAs 44
Now we add an indicator column that tells us if the total assigned units is one or not. This would be useful in order to see possible relations.
# add an additional predictor
fire_data$tua_is_one <- as.factor(ifelse(fire_data$total_assigned_unit == 1, TRUE, FALSE))
See the relative incidents count for each incident class.
irt_i_class_su <- arrange(fire_data %>%
filter(inc_resp_min_indc == "N",tua_is_one == TRUE) %>%
group_by(inc_class_group) %>%
summarise(incident_number = n()), desc(incident_number))
irt_i_class_su
## # A tibble: 5 × 2
## inc_class_group incident_number
## <fct> <int>
## 1 M_E 14950
## 2 NonM_E 836
## 3 NonM_MFAs 128
## 4 M_MFAs 28
## 5 NonS_F 19
And we found that majority of all observation if not almost the entire dataset records are from the Medical Emergencies with only 166 incidents that have been assigned more than a single unit. Whereas almost all the other incidents are from the NonMedical Emergencies where in this case the incidents that have been assigned more than a single unit are more, so we have to take into account this higher amount of observations. Indeed we have 753 NonMedical Emergencies incidents that have invalid response time and have been assigned more than a single unit.
At this point we decide to view the presence or not of a pattern of incident with invalid response time that belongs to the Medical Emergencies class with a single unit assigned, first by grouping by borough.
tuaisone <- fire_data %>%
filter(inc_resp_min_indc == "N", inc_class_group == "M_E") %>%
group_by(inc_borough, tua_is_one) %>%
summarise(incident_number = n())
## `summarise()` has grouped output by 'inc_borough'. You can override using the
## `.groups` argument.
Plot the bar chart.
ggplot(data=tuaisone,
aes(x=inc_borough, y=incident_number, fill=tua_is_one)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=incident_number), vjust=1.5, color="black",
position = position_dodge(0.9), size=3.5) +
scale_fill_brewer(palette="Set1") +
labs(title = "Total Assigned Units One or Not for Invaid Resp Min Time",
x = "Borough", y = "Incident Count", fill = "Total Assigned\nUnits are One")
And we can see that Bronx, Brooklyn and Manhattan have more or less the same amount of incidents with invalid response time in minutes with a single assigned unit.
Now we continue this analysis of the invalid response time by changing the focus on the type of incident class (the sub-category which is more precise) with a single total assigned units.
ggplot(data=fire_data %>%
filter(inc_resp_min_indc == "N", inc_class_group == "M_E", tua_is_one == TRUE) %>%
group_by(inc_class, inc_borough) %>%
summarise(incident_number = n()),
aes(x=inc_borough, y=incident_number, fill=inc_class)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=incident_number), vjust=-0.5, color="black",
position = position_dodge(0.9), size=3) +
scale_fill_brewer(palette="Set1") +
labs(title = "Borough - Incident Counts - Incident Class -- for Total Assigned Units equal to 1 for Invalid Responde Time",
x = "Borough", y = "Incident Counts", fill = "Incident Class Group")
## `summarise()` has grouped output by 'inc_class'. You can override using the
## `.groups` argument.
And we found that the majority of the incidents that respect these circumstances are mostly identified as Medical - EMS Link 10-91 and Medical - PD Link 10-91.
Thanks to the 10code site we found a description of the two emergency codes:
10-91 Medical Emergency EMS - Fire Unit Not Required - To be transmitted through borough dispatcher by the responding unit when the fire Unit is canceled enroute due to EMS on scene, or EMS downgrades the job to a segment that does not require a Fire Unit response. Note: This signal shall be used only for medical emergency incidents. EMS we are sure stands for Emergency Medical Services.
10-91 Medical Emergency PD - Fire Unit Not Required - To be transmitted through borough dispatcher by the responding unit when the fire Unit is canceled enroute due to PD on scene, or PD downgrades the job to a segment that does not require a Fire Unit response. Note: This signal shall be used only for medical emergency incidents. PD we think that stands for Police Department.
Thus we can trust this information since they make sense and consider only the observations that have inc_resp_min_indc == "Y".
Now we can look for the NonMedical Emergencies since they are the second category for the number of observations by first seeing the distribution of its incident class.
Here we print the top 5 classes which respect the following conditions: inc_resp_min_indc == "N", inc_class_group == "NonMedical Emergencies"
print(head(arrange(fire_data %>%
filter(inc_resp_min_indc == "N",
inc_class_group == "NonM_E") %>%
group_by(inc_class) %>%
summarise(incident_number = n()), desc(incident_number))))
## # A tibble: 6 × 2
## inc_class incident_number
## <fct> <int>
## 1 Assist Civilian - Non-Medical 828
## 2 Alarm System - Unnecessary 110
## 3 Elevator Emergency - Occupied 104
## 4 Vehicle Accident - Other 100
## 5 Maritime Emergency 62
## 6 Utility Emergency - Water 60
And we found that the majority of non valid inc_resp_min_indc that are Non-Medical Emergency are from the incident class group equal to Assist Civilian - Non-Medical.
ggplot(data=fire_data %>%
filter(inc_resp_min_indc == "N",
inc_class == "Assist Civilian - Non-Medical") %>%
group_by(inc_borough) %>%
summarise(incident_number = n()),
aes(x=inc_borough, y=incident_number)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=incident_number), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5) +
labs(title = "Assist Civilian - Non-Medical / Invalid Response / More than 1 Units",
x = "Borough", y = "Incident Count")
We can see that except Staten Island the amount of incidents is pretty constant.
So for a stake of simplicity we decided to trust the indicator of incident response time and thus considering only the observations that have inc_resp_min_indc == "Y".
fire_data <- fire_data %>% filter(inc_resp_min_indc == "Y")
dim(fire_data)
## [1] 32964 33
Now we want to know how many inc_class are summarized in each inc_class_group and if there are any intersections between any incident classes. We perform this analysis to be sure that each inc_class_group is referred to as a single inc_class.
unique_category <- fire_data%>%
group_by(inc_class) %>%
summarise(unique_maincategory = n_distinct((inc_class_group)))
conflict <- unique_category %>% filter(unique_maincategory > 1)
if (dim(conflict)[1] == 0){
print("There are NO conflict between main and sub-category")
} else {
print("There are conflict between main and sub-category")
}
## [1] "There are NO conflict between main and sub-category"
So we are sure that each sub category has a unique main-category incident.
At this point to be more clear we display each main class with each respective sub-class.
for (variable in levels(fire_data$inc_class_group)) {
non_zero_table <- table(subset(fire_data, inc_class_group == variable)$inc_class)
cat(variable, "\n")
print(non_zero_table[non_zero_table != 0])
cat("\n")
}
## M_E
##
## Medical - Assist Civilian Medical - Breathing / Ill or Sick
## 27 4779
## Medical - EMS Link 10-91 Medical - No PT Contact EMS is Onscene
## 1096 4285
## Medical - PD Link 10-91 Medical - Serious Life Threatening
## 868 366
## Medical - Victim Deceased
## 287
##
## NonM_E
##
## Alarm System - Defective
## 377
## Alarm System - Testing
## 706
## Alarm System - Unnecessary
## 2735
## Assist Civilian - Non-Medical
## 3312
## Carbon Monoxide - Code 1 - Investigation
## 788
## Carbon Monoxide - Code 2 - Incident (1-9 ppm)
## 129
## Carbon Monoxide - Code 3 - Emergency (over 9 ppm)
## 88
## Carbon Monoxide - Code 4 - No Detector Activation
## 8
## Defective Oil Burner
## 34
## Downed Tree
## 280
## Elevator Emergency - Occupied
## 1850
## Elevator Emergency - Unoccupied
## 708
## Non-Medical 10-91 (Unnecessary Alarm)
## 102
## Odor - Other Smoke
## 166
## Odor - Other Than Smoke
## 1317
## Remove Civilian - Non-Fire
## 27
## Sprinkler System - Activated
## 6
## Sprinkler System - Malfunction
## 41
## Sprinkler System - Working on System
## 28
## Transit System Emergency
## 18
## Undefined Emergency
## 71
## Utility Emergency - Electric
## 595
## Utility Emergency - Gas
## 1335
## Utility Emergency - Steam
## 137
## Utility Emergency - Undefined
## 4
## Utility Emergency - Water
## 1157
## Vehicle Accident - Other
## 1443
## Vehicle Accident - With Extrication
## 21
##
## M_MFAs
##
## Medical MFA - EMS Link Medical MFA - PD Link
## 87 77
##
## NonM_MFAs
##
## Non-Medical MFA - ERS Non-Medical MFA - ERS No Contact
## 586 1
## Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm
## 701 223
## Non-Medical MFA - Verbal
## 7
##
## S_F
##
## Church Fire
## 10
## Factory Fire
## 1
## Hospital Fire
## 18
## Multiple Dwelling 'A' - Compactor fire
## 4
## Multiple Dwelling 'A' - Food on the stove fire
## 519
## Multiple Dwelling 'A' - Other fire
## 168
## Multiple Dwelling 'B' Fire
## 85
## Other Commercial Building Fire
## 184
## Other Public Building Fire
## 4
## Private Dwelling Fire
## 412
## School Fire
## 31
## Store Fire
## 9
## Transit System - Structural
## 1
## Under Contruction / Vacant Fire
## 1
##
## NonS_F
##
## Abandoned Derelict Vehicle Fire Automobile Fire
## 6 101
## Brush Fire Demolition Debris or Rubbish Fire
## 24 272
## Manhole Fire - Blown Cover Manhole Fire - Other
## 9 55
## Manhole Fire - Seeping Smoke Other Transportation Fire
## 104 14
## Transit System - NonStructural
## 59
At this point it is essential to deal with NA values, trying to find the presence of possible relation with predictors. First things first let’s recap the number of NA values for each column that we have at the moment.
colSums(is.na(fire_data))
## id al_number al_location
## 0 0 0
## inc_borough zipcode pol_prec
## 0 2197 2197
## city_con_dist commu_dist commu_sc_dist
## 2197 2197 2198
## cong_dist al_source_desc al_index_desc
## 2197 0 0
## highest_al_level inc_class inc_class_group
## 0 0 0
## disp_resp_min_qy first_ass_datetime first_act_datetime
## 0 0 41
## first_onscene_datetime inc_close_datetime disp_resp_min_indc
## 0 0 0
## inc_resp_min_indc inc_resp_min_qy inc_travel_min_qy
## 0 0 0
## engines_assigned ladders_assigned others_units_assigned
## 4 4 4
## emergency_min_qy day_type working_hour
## 0 0 0
## time_of_day total_assigned_unit tua_is_one
## 0 4 4
Here we will check if there is a pattern on the absence of values in the following predictors: zipcode, pol_prec, city_con_dist, commu_dist, commu_sc_dist and cong_dist.
na_locations <- fire_data %>%
filter(is.na(zipcode) | is.na(pol_prec) | is.na(city_con_dist) | is.na(commu_dist) | is.na(commu_sc_dist) | is.na(cong_dist))
ggplot(data=na_locations %>%
group_by(inc_class_group, inc_borough) %>%
summarise(incident_number = n()),
aes(x=inc_borough, y=incident_number, fill=inc_class_group)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=incident_number), vjust=-0.5, color="black",
position = position_dodge(0.9), size=3.5) +
scale_fill_brewer(palette="Set1") +
labs(title = "NA Locations Columns", x = "Borough",
y = "Incident Count", fill = "Incident Class Group")
## `summarise()` has grouped output by 'inc_class_group'. You can override using
## the `.groups` argument.
By the Bar Chart we note that the majority of observations that have at least one of the location predictors to NA are of the incident class group NonMedical Emergency, Medical Emergencies and Non Medical MFAs.
Here we verify if the proportion of NA is equally distributed around each borough.
table(na_locations$inc_borough) / table(fire_data$inc_borough)
##
## Bronx Brooklyn Manhattan Queens Staten Island
## 0.07104538 0.04694547 0.07662157 0.07700328 0.07523697
And we found that Brooklyn has a lower percentage of NA location observations with respect to the others that are constant between them.
Thus we compute the same process but this time looking into the incident class group, like before we would like to have a constant distribution among all classes.
table(na_locations$inc_class_group) / table(fire_data$inc_class_group)
##
## M_E NonM_E M_MFAs NonM_MFAs S_F NonS_F
## 0.03988726 0.05691243 0.10975610 0.40447958 0.00552868 0.14906832
However, by looking at the proportion with the original dataset around 40% of the whole observations of type NonMedical MFAs have at least one of the location columns to NA. Let’s investigate.
In the original dataset this is the distribution of incident class for the NonMedical MFA incident class group.
fd_nm_mfa_cl <- table(subset(fire_data,
inc_class_group == "NonM_MFAs")$inc_class)
fd_nm_mfa_bro <- table(subset(fire_data,
inc_class_group == "NonM_MFAs")$inc_borough)
fd_nm_mfa_cl <- fd_nm_mfa_cl[fd_nm_mfa_cl != 0]
fd_nm_mfa_cl
##
## Non-Medical MFA - ERS Non-Medical MFA - ERS No Contact
## 586 1
## Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm
## 701 223
## Non-Medical MFA - Verbal
## 7
And this is the distribution of incident class for the NonMedical MFA incident class group, but this time w.r.t. the subset of incidents that have at least one location column to NA.
na_nm_mfa_cl <- table(subset(na_locations,
inc_class_group == "NonM_MFAs")$inc_class)
na_nm_mfa_bro <- table(subset(na_locations,
inc_class_group == "NonM_MFAs")$inc_borough)
na_nm_mfa_cl <- na_nm_mfa_cl[names(fd_nm_mfa_cl)]
na_nm_mfa_cl
##
## Non-Medical MFA - ERS Non-Medical MFA - ERS No Contact
## 573 1
## Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm
## 38 2
## Non-Medical MFA - Verbal
## 0
And we can clearly see that the observations of category Non-Medical MFA - ERS are the main.
na_nm_mfa_cl / fd_nm_mfa_cl
##
## Non-Medical MFA - ERS Non-Medical MFA - ERS No Contact
## 0.97781570 1.00000000
## Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm
## 0.05420827 0.00896861
## Non-Medical MFA - Verbal
## 0.00000000
And more interestingly the 97% of all the Non-Medical MFA - ERS observations in the entire dataset have one of the location predictors equal to NA.
na_nm_mfa_bro / fd_nm_mfa_bro
##
## Bronx Brooklyn Manhattan Queens Staten Island
## 0.4676056 0.3075221 0.3894472 0.3733333 0.7954545
And from here we can see that about 78% of the observations that are NonMedical - MFAs that have at least one district column attribute to NA are from Staten Island. Also Bronx has about half of the NonMedical - MFAs observations that have at least one district column to NA.
ggplot(data=na_locations %>%
filter(inc_class == "Non-Medical MFA - ERS") %>%
group_by(inc_borough) %>%
summarise(incident_count = n()),
aes(x=inc_borough, y=incident_count)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=incident_count), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5) +
labs(title = "Non-Medical MFA - ERS", x = "Borough", y = "Incident Count")
Finally we have a lower number of Non-Medical MFA - ERS for Queens and Staten Island.
So concluding NonMedical MFA stands for Non Medical - Medical First Aid, unfortunately we are not able to find the meaning of ERS, even if we think that could be possible be connected with something regarding the respiratory system like ERS: Emergency Respiratory System. And thus we think that these observations have NA location columns for the same reason as the previously discussed 10-91 Medical Emergency. However these are all suppositions, maybe by contacting the NYC Firefighters Department by mail should make things much more clear.
Again for sake of simplicity we have assumed that the observations that have at least one NA location are randomly spread among the dataset.
print(fire_data %>%
filter(is.na(engines_assigned) | is.na(ladders_assigned) | is.na(others_units_assigned)))
## id al_number al_location inc_borough
## 1 230905-Q4545-001-0683 4545 53 AVE & 69 ST Queens
## 2 230914-Q1014-001-0693 1014 CENTRAL AVE & VIRGINIA ST/LCFD Queens
## 3 230918-Q9643-002-0752 9643 JAMAICA AVE & 114 ST Queens
## 4 230919-M0684-002-0995 684 1 AVE & E 30 ST Manhattan
## zipcode pol_prec city_con_dist commu_dist commu_sc_dist cong_dist
## 1 11378 104 30 405 24 6
## 2 11691 101 31 414 27 5
## 3 11418 102 29 409 27 5
## 4 10016 13 4 106 2 12
## al_source_desc al_index_desc highest_al_level
## 1 EMS Initial Alarm First Alarm
## 2 EMS-911 DEFAULT RECORD First Alarm
## 3 PHONE Initial Alarm First Alarm
## 4 Others Initial Alarm First Alarm
## inc_class inc_class_group disp_resp_min_qy
## 1 Medical - No PT Contact EMS is Onscene M_E 0.2666667
## 2 Medical - PD Link 10-91 M_E 0.1000000
## 3 Assist Civilian - Non-Medical NonM_E 0.5833333
## 4 Vehicle Accident - Other NonM_E 0.7000000
## first_ass_datetime first_act_datetime first_onscene_datetime
## 1 2023-09-05 16:32:44 2023-09-05 16:33:34 2023-09-05 16:37:22
## 2 2023-09-14 22:57:58 2023-09-14 22:58:16 2023-09-14 23:02:24
## 3 2023-09-18 23:44:39 2023-09-18 23:45:11 2023-09-18 23:50:08
## 4 2023-09-19 16:35:57 2023-09-19 16:36:07 2023-09-19 16:46:50
## inc_close_datetime disp_resp_min_indc inc_resp_min_indc inc_resp_min_qy
## 1 2023-09-05 16:40:41 N Y 4.900000
## 2 2023-09-14 23:05:08 N Y 4.533333
## 3 2023-09-19 00:07:10 N Y 6.066667
## 4 2023-09-19 16:56:33 N Y 11.583333
## inc_travel_min_qy engines_assigned ladders_assigned others_units_assigned
## 1 4.633333 NA NA NA
## 2 4.433333 NA NA NA
## 3 5.483333 NA NA NA
## 4 10.883333 NA NA NA
## emergency_min_qy day_type working_hour time_of_day total_assigned_unit
## 1 3.316667 Weekday TRUE Afternoon NA
## 2 2.733333 Weekday FALSE Evening NA
## 3 17.033333 Weekday FALSE Evening NA
## 4 9.716667 Weekday TRUE Afternoon NA
## tua_is_one
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
We can easily remove these observations, since they do not appear in any pattern and are only 4.
Now we look into First Act Datetime and see if there is any pattern with the other columns.
na_first_act_datetime <- fire_data %>% filter(is.na(first_act_datetime))
print(arrange(na_first_act_datetime %>%
group_by(inc_class, inc_borough) %>%
summarise(incident_count = n()), desc(incident_count)))
## `summarise()` has grouped output by 'inc_class'. You can override using the
## `.groups` argument.
## # A tibble: 27 × 3
## # Groups: inc_class [15]
## inc_class inc_borough incident_count
## <fct> <fct> <int>
## 1 Assist Civilian - Non-Medical Brooklyn 4
## 2 Medical - Breathing / Ill or Sick Brooklyn 4
## 3 Medical - Breathing / Ill or Sick Manhattan 4
## 4 Medical - No PT Contact EMS is Onscene Queens 3
## 5 Non-Medical MFA - ERS Brooklyn 3
## 6 Alarm System - Unnecessary Brooklyn 2
## 7 Assist Civilian - Non-Medical Bronx 1
## 8 Assist Civilian - Non-Medical Queens 1
## 9 Demolition Debris or Rubbish Fire Brooklyn 1
## 10 Downed Tree Manhattan 1
## # ℹ 17 more rows
ggplot(data=na_first_act_datetime %>%
group_by(inc_class_group, inc_borough) %>%
summarise(incident_count = n()),
aes(x=inc_borough, y=incident_count, fill=inc_class_group)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=incident_count), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5) +
labs(title = "NA First Act Date", x = "Borough", y = "Incident Count",
fill = "Incident Class Group")
## `summarise()` has grouped output by 'inc_class_group'. You can override using
## the `.groups` argument.
Seems to be random and thus there is no pattern that motivates the presence of NA values in first_act_datetime.
At this point we can omit the NA values.
fire_data_new <- na.omit(fire_data)
And the un-useful predictors including disp_resp_min_indc and inc_travel_min_qy since these two information are contained in inc_resp_min_qy.
fire_data_new <- fire_data_new %>%
select(-c(zipcode, pol_prec, city_con_dist, commu_dist, al_location,
commu_sc_dist, first_ass_datetime, first_act_datetime,
first_onscene_datetime, inc_close_datetime, inc_resp_min_indc,
disp_resp_min_indc, inc_travel_min_qy, disp_resp_min_qy,
id, al_number, inc_class))
We also decided to remove the incident class since it contains too many category values which are summarised in the incident class group.
print(dfSummary(fire_data_new,
plain.ascii = FALSE,
style = "multiline",
headings = FALSE,
graph.magnif = 0.8,
valid.col = FALSE),
method = 'render')
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | inc_borough [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | cong_dist [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | al_source_desc [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | al_index_desc [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | highest_al_level [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | inc_class_group [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | inc_resp_min_qy [numeric] |
|
1147 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | engines_assigned [integer] |
|
15 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | ladders_assigned [integer] |
|
12 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | others_units_assigned [integer] |
|
22 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | emergency_min_qy [numeric] |
|
4040 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | day_type [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | working_hour [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | time_of_day [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | total_assigned_unit [integer] |
|
34 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | tua_is_one [factor] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-18
In this section we will have a look at additional data visualisation in order to better understand how the predictors behave. We divide this section in multiple subsections each focus on a relative subset of columns.
See the final columns that we have:
names(fire_data_new)
## [1] "inc_borough" "cong_dist" "al_source_desc"
## [4] "al_index_desc" "highest_al_level" "inc_class_group"
## [7] "inc_resp_min_qy" "engines_assigned" "ladders_assigned"
## [10] "others_units_assigned" "emergency_min_qy" "day_type"
## [13] "working_hour" "time_of_day" "total_assigned_unit"
## [16] "tua_is_one"
We divide the data visualisation in the following subsections:
In each section analysis we will use both responses: inc_resp_min_qy adn emergency_time_qy.
Here we study the inc_borough and cong_dist columns for both responses.
ggplot(fire_data_new,
aes(x = inc_borough, y = inc_resp_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Borough - Incident Minutes Response Time - Incident Class",
x = "Borough", y = "Incident Minutes Response Time", color = "Incident Class")
Here we can see that the response time is pretty the same around all boroughs for each incident class, indeed there is a slightly faster response with respect to the other class for the Structural Fires as expected. The highest number of outliers are from the Medical and NonMedical Emergencies.
ggplot(fire_data_new,
aes(x = inc_borough, y = emergency_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Borough - Emergency Minutes Time - Incident Class",
x = "Borough", y = "Emergency Minutes Time", color = "Incident Class")
Like before the Structural Fires require lots of time but only in some drastic cases identified by the outliers since on average they require even less time respect other categories.
bp1 <- ggplot(fire_data_new,
aes(x = cong_dist, y = inc_resp_min_qy)) +
geom_boxplot() +
labs(title = "Cong District - EmergencyMin Time",
x = "Cong District", y = "Emergency Min Time")
bp2 <- ggplot(fire_data_new,
aes(x = cong_dist, y = emergency_min_qy)) +
geom_boxplot() +
labs(title = "Cong District - Emergency Min Time",
x = "Cong District", y = "Emergency Minutes Time")
grid.arrange(bp1, bp2, ncol = 2)
Here we can’t see any particular relevent information since on average we have the same distribution of time for all congressional district.
Here we study the al_source_desc, al_index_desc and highest_al_level columns for both responses.
ggplot(fire_data_new,
aes(x = al_source_desc, y = inc_resp_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Alarm Source Description - Incident Minutes Response Time - Incident Class",
x = "Alarm Source Description", y = "Incident Minutes Response Time",
color = "Incident Class")
With this boxplot we can understand that there are many outliers for emergencies called via phone, in particular NonMedical Emergencies. Also in both EMS and EMS-911 there are many outliers for the Medical Emergencies. As expected then for emergencies called via EMS-911 we do not find both Fires emergencies and NonMedical MFAs.
ggplot(fire_data_new,
aes(x = al_source_desc, y = emergency_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Alarm Source Description - Emergency Minutes Time - Incident Class",
x = "Alarm Source Description", y = "Emergency Minutes Time",
color = "Incident Class")
Now the first thing that we see is the high amount of outliers for the emergencies called via phone that are structural fires, indeed on average the emergency time is higher for that alarm source category. Another interesting fact is for CLASS-3 the NonMedical MFAs have a big interquartile range, since we have little information for that particular clas
ggplot(fire_data_new,
aes(x = al_index_desc, y = inc_resp_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Alarm Index Description - Incident Minutes Response Time - Incident Class",
x = "Alarm Index Description", y = "Incident Minutes Response Time",
color = "Incident Class")
Again we can see many outliers for the two major classes, and for the Initial Alarm we do not have any Medical and NonMedical MFAs emergencies. For the Others class like in the previous cans the interquartile range for the NonStructural Fires is big since in the merged category we do not have many observations.
ggplot(fire_data_new,
aes(x = al_index_desc, y = emergency_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Alarm Index Description - Emergency Minutes Time - Incident Class",
x = "Alarm Index Description", y = "Emergency Minutes Time",
color = "Incident Class")
Interestingly here we can see that around all the emergencies with index level DEFAULT RECORD have very thin interquartile range with also low outliers, with respect to the Initial Alarm and Others category level.
ggplot(fire_data_new,
aes(x = highest_al_level, y = inc_resp_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Highest Alarm Level - Incident Minutes Response Time - Incident Class",
x = "Highest Alarm Level", y = "Incident Minutes Response Time",
color = "Incident Class")
Pretty all emergency of First Alarm have the same amount of response time with particular outliers for Medical and NonMedical Emergencies and also for NonMedical MFAs, whereas for NonFirst Alarm since we have few observation there for the NonStructural Fire we have a big range of interquartile.
ggplot(fire_data_new,
aes(x = highest_al_level, y = emergency_min_qy,
color = inc_class_group)) +
geom_boxplot() +
labs(title = "Highest Alarm Level - Emergency Minutes Time - Incident Class",
x = "Highest Alarm Level", y = "Emergency Minutes Time",
color = "Incident Class")
Similarly to the previous subsection here all the emergencies with highest alarm level to First Alarm have a low emergency time in particular Medical and NonMedical MFAs, again we have a significant number of outliers for this class. Regarding NonFirst Level we have few observations and that is why the emergency time is so different with respect to the other class. Even if the longest time taken emergencies are a structural fire with the highest alarm level NonFirst Alarm.
Here we study the total_assigned_unit and tua_is_one columns for both responses.
ggplot(fire_data_new,
aes(x = total_assigned_unit, y = inc_resp_min_qy,
color = inc_class_group)) +
geom_point(aes(color = inc_class_group)) +
labs(title = "Total Units Assigned - Incident Minutes Response Time - Incident Class",
x = "Total Units Assigned", y = "Incident Minutes Response Time",
color = "Incident Class")
Whereas here as described in a previous similar chart, the total assigned units and the incident response time in minutes appear to be inversely proportional, with the contradistinction of Structural Fires that have many assigned units, and low response time and NonMedical Emergencies that have low assigned units and high response time.
ggplot(fire_data_new,
aes(x = total_assigned_unit, y = emergency_min_qy,
color = inc_class_group)) +
geom_point(aes(color = inc_class_group)) +
labs(title = "Total Units Assigned - Emergency Minutes Time - Incident Class",
x = "Total Units Assigned", y = "Emergency Minutes Time",
color = "Incident Class")
Here we can see that the Structural Fire incident has lots of variance directly and seems to be a directly proportional relationship between the total assigned units and the emergency time. For the NonMedical Emergencies they are clustered and then for the Medical Emergencies we can see that they have been mostly assigned a single unit.
ggplot(fire_data_new,
aes(x = tua_is_one, y = inc_resp_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Total Assigned Unit is 1 - Incident Minutes Response Time - Incident Class",
x = "Total Assigned Unit is 1", y = "Incident Minutes Response Time",
color = "Incident Class")
Here we can see a quite different respect to having a total assigned unit equal to one and not, indeed every incident class has a higher mean value for the incident which had been assigned a single unit.
ggplot(fire_data_new,
aes(x = tua_is_one, y = emergency_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Total Assigned Unit is 1 - Emergency Minutes Time - Incident Class",
x = "Total Assigned Unit is 1", y = "Emergency Minutes Time",
color = "Incident Class")
And in this case we can see that there is less variation on the emergency time response for the incident that had been assigned a single unit, with respect to the incident with more units assigned. And as expected the Structural Fires with more than one unit are the type of incident with most outliers given their severity.
Here we study the relation of inc_resp_min_qy against emergency_time_qy.
ggplot(fire_data_new,
aes(x = inc_resp_min_qy, y = emergency_min_qy, color = inc_class_group)) +
geom_point(aes(color = inc_class_group)) +
labs(title = "Incident Minutes Response Time - Emergency Minutes Time - Incident Class",
x = "Incident Minutes Response Time", y = "Emergency Minutes Time",
color = "Incident Class")
Here we can see many different distributions with the relation that the less the response time is the higher is the time taken to close the emergency.
Since we can’t understand much by this chart we decide to plot the density of each incident class group for both responses.
ggplot(fire_data_new, aes(x = inc_resp_min_qy, fill = inc_class_group)) +
geom_density(alpha = 0.3) + xlim(0, 20) +
labs(title = " Density Incident Minutes Response Time - Incident Class",
x = "Incident Minutes Response Time", y = "Density",
fill = "Incident Class")
## Warning: Removed 129 rows containing non-finite values (`stat_density()`).
In this density chart we can see that the distribution of Incident Response Time for each incident class seems to behave like a Log-Norm distribution. We limit the x axis in order to be able to understand the distribution otherwise it will result in a very thin curve so not much easy to understand.
ggplot(fire_data_new, aes(x = emergency_min_qy, fill = inc_class_group)) +
geom_density(alpha = 0.5) + xlim(0, 75) +
scale_fill_brewer(palette="Set1") +
labs(title = "Density Emergency Response Time - Incident Class",
x = "Emergency Response Time", y = "Density",
fill = "Incident Class")
## Warning: Removed 482 rows containing non-finite values (`stat_density()`).
Similarly as before also here the distribution of Incident Time for each incident class seems to behave like a Log-Norm distribution. Again like before we limit the x axis in order to be able to understand the distribution otherwise it will result in a very thin curve so not much easy to understand.
Here we study the day_type, working_hour and time_of_day columns for both responses.
ggplot(fire_data_new,
aes(x = day_type, y = inc_resp_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Day Type - Incident Minutes Response Time - Incident Class",
x = "Day Type", y = "Incident Minutes Response Time",
color = "Incident Class")
Speaking about the day we can’t see any relevant pattern here except that it appears on the weekend we have a lower number of outliers for all the incident classes.
ggplot(fire_data_new,
aes(x = day_type, y = emergency_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Day Type - Emergency Minutes Time - Incident Class",
x = "Day Type", y = "Emergency Minutes Time", color = "Incident Class")
Here there are any kind of patterns that we can recondunct to the day type for the emergency time taken.
ggplot(fire_data_new,
aes(x = working_hour, y = inc_resp_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Working Hour - Incident Minutes Response Time - Incident Class",
x = "Working Hour", y = "Incident Minutes Response Time",
color = "Incident Class")
Even for the working hour we can’t see any pattern regarding the incident response time.
ggplot(fire_data_new,
aes(x = working_hour, y = emergency_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Working Hour - Emergency Minutes Time - Incident Class",
x = "Working Hour", y = "Incident Minutes Time",
color = "Incident Class")
And even for the emergency time taken.
ggplot(fire_data_new,
aes(x = time_of_day, y = inc_resp_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Day Time - Incident Minutes Response Time - Incident Class",
x = "Day Time", y = "Incident Minutes Response Time",
color = "Incident Class")
Here we can’t see any relevant pattern except the fact that we have a higher number of outliers in the afternoon for the Medical, NonMedical Emergencies and Structural Fires.
ggplot(fire_data_new,
aes(x = time_of_day, y = emergency_min_qy, color = inc_class_group)) +
geom_boxplot() +
labs(title = "Day Time - Emergency Minutes Time - Incident Class",
x = "Day Time", y = "Incident Minutes Time", color = "Incident Class")
Here interestingly we can see that the longest time taken for emergencies are Structural Fires and they happened in the Night and in the Morning. Except this we do not have any pattern.
In this section we plot additional data visualization focused on the geographical visualization of the New York borough and congressional district with relative predictors. In order to do so we load an additional datasets:
firefighter_stations <- read.csv("datasets/fdny-firehouse-listing.csv")
head(firefighter_stations)
## FacilityName FacilityAddress
## 1 Engine 4/Ladder 15 42 South Street
## 2 Engine 10/Ladder 10 124 Liberty Street
## 3 Engine 6 49 Beekman Street
## 4 Engine 7/Ladder 1/Battalion 1/Manhattan Borough Command 100-104 Duane Street
## 5 Ladder 8 14 North Moore Street
## 6 Engine 9/Ladder 6 75 Canal Street
## Borough Postcode Latitude Longitude Community.Board Community.Council
## 1 Manhattan 10005 40.70347 -74.00754 1 1
## 2 Manhattan 10006 40.71007 -74.01252 1 1
## 3 Manhattan 10038 40.71005 -74.00525 1 1
## 4 Manhattan 10007 40.71546 -74.00594 1 1
## 5 Manhattan 10013 40.71976 -74.00668 1 1
## 6 Manhattan 10002 40.71521 -73.99290 3 1
## Census.Tract BIN BBL
## 1 7 1000867 1000350001
## 2 13 1075700 1000520022
## 3 1501 1001287 1000930030
## 4 33 1001647 1001500025
## 5 33 1002150 1001890035
## 6 16 1003898 1003000030
## NTA
## 1 Battery Park City-Lower Manhattan
## 2 Battery Park City-Lower Manhattan
## 3 Battery Park City-Lower Manhattan
## 4 SoHo-TriBeCa-Civic Center-Little Italy
## 5 SoHo-TriBeCa-Civic Center-Little Italy
## 6 Chinatown
summary(firefighter_stations)
## FacilityName FacilityAddress Borough Postcode
## Length:218 Length:218 Length:218 Min. :10001
## Class :character Class :character Class :character 1st Qu.:10304
## Mode :character Mode :character Mode :character Median :11103
## Mean :10784
## 3rd Qu.:11231
## Max. :11695
## NA's :5
## Latitude Longitude Community.Board Community.Council
## Min. :40.51 Min. :-74.24 Min. : 1.000 Min. : 1.00
## 1st Qu.:40.66 1st Qu.:-73.99 1st Qu.: 3.000 1st Qu.:12.00
## Median :40.72 Median :-73.94 Median : 6.000 Median :27.00
## Mean :40.72 Mean :-73.94 Mean : 7.075 Mean :25.63
## 3rd Qu.:40.77 3rd Qu.:-73.89 3rd Qu.:11.000 3rd Qu.:38.00
## Max. :40.89 Max. :-73.72 Max. :84.000 Max. :51.00
## NA's :5 NA's :5 NA's :5 NA's :5
## Census.Tract BIN BBL NTA
## Min. : 1 Min. :1000867 Min. :1.000e+09 Length:218
## 1st Qu.: 129 1st Qu.:2003268 1st Qu.:2.025e+09 Class :character
## Median : 275 Median :3064786 Median :3.025e+09 Mode :character
## Mean : 5950 Mean :2900421 Mean :2.850e+09
## 3rd Qu.: 800 3rd Qu.:4090228 3rd Qu.:4.033e+09
## Max. :157902 Max. :5154879 Max. :5.080e+09
## NA's :5 NA's :5 NA's :5
We now start with the firefighter stations dataset. By first making a copy of the fire_data_new and setting the borough from the firefighter_stations dataset to factor in order to be easily merged with the copied fire_data dataset.
# make a copy of the fire_data
fire_data_for_ffs <- fire_data_new
fire_data_for_ffs <- fire_data_for_ffs %>% rename(borough = inc_borough)
firefighter_stations$Borough <- as.factor(firefighter_stations$Borough)
firefighter_stations <- firefighter_stations %>% rename(borough = Borough)
# remove the na values from firefighter_stations
firefighter_stations <- na.omit(firefighter_stations)
Now we want to get the number of firefighter stations for each borough.
stations_borough <- firefighter_stations %>%
group_by(borough) %>%
summarise(number_of_stations = n())
Now we want to get a summary of the incident count, the number of stations and the incident per station of each borough in order to have a general view of the New York City situation.
This step is done twice. We have to group both for borough and then for congressional district.
count_inc_brough <- fire_data_for_ffs %>%
group_by(borough) %>%
summarise(incident_count = n())
count_inc_cdist <- fire_data_for_ffs %>%
group_by(cong_dist) %>%
summarise(incident_count = n())
stations_borough$incident_per_station <- round(
count_inc_brough$incident_count / stations_borough$number_of_stations,
digits = 3)
count_inc_brough <- merge(count_inc_brough, stations_borough, by="borough")
count_inc_brough
## borough incident_count number_of_stations incident_per_station
## 1 Bronx 6406 34 188.412
## 2 Brooklyn 9280 64 145.000
## 3 Manhattan 7294 47 155.191
## 4 Queens 6188 48 128.917
## 5 Staten Island 1560 20 78.000
Now we convert the firefighter_station data frame into a Spartial Data Frame to contain the geometry points.
firefighter_stations_sdf <- st_as_sf(firefighter_stations,
coords = c("Longitude", "Latitude"),
crs = 4326)
head(firefighter_stations_sdf)
## Simple feature collection with 6 features and 10 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -74.01252 ymin: 40.70347 xmax: -73.9929 ymax: 40.71976
## Geodetic CRS: WGS 84
## FacilityName FacilityAddress
## 1 Engine 4/Ladder 15 42 South Street
## 2 Engine 10/Ladder 10 124 Liberty Street
## 3 Engine 6 49 Beekman Street
## 4 Engine 7/Ladder 1/Battalion 1/Manhattan Borough Command 100-104 Duane Street
## 5 Ladder 8 14 North Moore Street
## 6 Engine 9/Ladder 6 75 Canal Street
## borough Postcode Community.Board Community.Council Census.Tract BIN
## 1 Manhattan 10005 1 1 7 1000867
## 2 Manhattan 10006 1 1 13 1075700
## 3 Manhattan 10038 1 1 1501 1001287
## 4 Manhattan 10007 1 1 33 1001647
## 5 Manhattan 10013 1 1 33 1002150
## 6 Manhattan 10002 3 1 16 1003898
## BBL
## 1 1000350001
## 2 1000520022
## 3 1000930030
## 4 1001500025
## 5 1001890035
## 6 1003000030
## NTA
## 1 Battery Park City-Lower Manhattan
## 2 Battery Park City-Lower Manhattan
## 3 Battery Park City-Lower Manhattan
## 4 SoHo-TriBeCa-Civic Center-Little Italy
## 5 SoHo-TriBeCa-Civic Center-Little Italy
## 6 Chinatown
## geometry
## 1 POINT (-74.00754 40.70347)
## 2 POINT (-74.01252 40.71007)
## 3 POINT (-74.00525 40.71005)
## 4 POINT (-74.00594 40.71546)
## 5 POINT (-74.00668 40.71976)
## 6 POINT (-73.9929 40.71521)
At this point we download the .geojson files that contain all the geometry of each borough in order to have a cool map visualization of NYC.
Here we load the NYC_BoroughBoundaries.geojson.
geoj_nyc_borough <- geojson_read("datasets/NYC_BoroughBoundaries.geojson",
what = "sp")
geoj_nyc_borough <- setNames(geoj_nyc_borough,
c("borough_code", "borough", "shape_area", "shape_leng"))
geoj_nyc_borough$borough <- as.factor(geoj_nyc_borough$borough)
geoj_nyc_borough$borough_code <- NULL
head(geoj_nyc_borough)
## borough shape_area shape_leng
## 1 Staten Island 1623620725.05 325917.35395
## 2 Manhattan 636520502.758 357713.308162
## 3 Bronx 1187174772.5 463180.579449
## 4 Brooklyn 1934138215.76 728146.574928
## 5 Queens 3041418506.64 888199.731385
And here the Congressiona_Districts.geojson.
geoj_nyc_cdist <- geojson_read("datasets/Congressional_Districts.geojson",
what = "sp")
geoj_nyc_cdist$cong_dist <- as.factor(geoj_nyc_cdist$cong_dist)
head(geoj_nyc_cdist)
## cong_dist shape_area shape_leng
## 1 16 53350456.3831 38220.195314
## 2 11 1810287145.51 411602.819728
## 3 5 1313441026.63 647233.193924
## 4 6 723356617.906 198557.173232
## 5 8 736292261.799 575320.090507
## 6 9 422693976.037 115321.546362
And now we merge geoj_nyc_borough with count_inc_brough maintaining the Spartial Data Frame type.
geoj_nyc_borough@data = data.frame(geoj_nyc_borough@data,
count_inc_brough[match(geoj_nyc_borough@data$borough,
count_inc_brough$borough),])
geoj_nyc_borough@data$borough.1 <- NULL # remove the added column
Here we are making the same step as before but this with the geoj_nyc_cdist merging via cong_dist, again maintaining the Spartial Data Frame type.
geoj_nyc_cdist@data = data.frame(geoj_nyc_cdist@data, count_inc_cdist[match(geoj_nyc_cdist@data$cong_dist, count_inc_cdist$cong_dist),])
geoj_nyc_cdist@data$cong_dist.1 <- NULL # remove the added column
And finally we can plot the interactive maps using the mapview function.
First for Incident Borough:
mapview(list(firefighter_stations_sdf, geoj_nyc_borough),
zcol = list(NULL, "incident_count"),
legend = list(FALSE, TRUE),
homebutton = list(FALSE, TRUE),
layer.name = list(NULL, "Incidents Count"),
popup = list(popupTable(firefighter_stations_sdf),
popupTable(geoj_nyc_borough)),
alpha.regions = 0.5, aplha = 1)
And then for Congressional District:
mapview(geoj_nyc_cdist,
zcol = "incident_count",
legend = TRUE,
popup = popupTable(geoj_nyc_cdist),
homebutton = TRUE,
layer.name = "Incidents Count",
alpha.regions = 0.5, aplha = 1)
As suggested by the professor we have opted to linear regression using as response:
1- inc_resp_min_qy 2- emergency_min_qy
Initially we were thinking to solve a multi-classification / binary classification problem for the inc_class_group, however we were considering all the time difference predictors that are a future information w.r.t. the inc_class_group in prediction time, so it doesn’t make much sense to use them, and it is possible also that they will result in super predictors. That’s why we decided to grab the professor’s suggestion.
For both analyses we transform the relative response in log scale in order to simulate the behaviour of the Log-Norm distribution. Of course we have to verify that the linearity assumptions are met.
So first things first let’s check if there are some observations that have at least one of the two possible responses equal to zero, if so we have to remove them in order to apply next the log transformation.
summary(fire_data_new %>% select(inc_resp_min_qy, emergency_min_qy))
## inc_resp_min_qy emergency_min_qy
## Min. : 0.350 Min. : 0.00
## 1st Qu.: 4.383 1st Qu.: 7.15
## Median : 5.467 Median : 11.93
## Mean : 5.949 Mean : 17.08
## 3rd Qu.: 6.850 3rd Qu.: 19.58
## Max. :58.917 Max. :596.38
Remove them.
fire_data_new <- fire_data_new %>% filter(emergency_min_qy != 0)
Then we have to check the presence of correlation in the continuous predictors and if so deleting one or more of them. Thus here we are doing an initial chec of multicollinearity problems only in the numerical variable before once we create our for the future models. In order to do so, we compute the square of the correlation matrix.
round(cor(fire_data_new %>% dplyr::select(where(is.numeric)))^2, digits=3)
## inc_resp_min_qy engines_assigned ladders_assigned
## inc_resp_min_qy 1.000 0.064 0.015
## engines_assigned 0.064 1.000 0.317
## ladders_assigned 0.015 0.317 1.000
## others_units_assigned 0.023 0.294 0.325
## emergency_min_qy 0.000 0.046 0.017
## total_assigned_unit 0.045 0.719 0.693
## others_units_assigned emergency_min_qy
## inc_resp_min_qy 0.023 0.000
## engines_assigned 0.294 0.046
## ladders_assigned 0.325 0.017
## others_units_assigned 1.000 0.121
## emergency_min_qy 0.121 1.000
## total_assigned_unit 0.704 0.077
## total_assigned_unit
## inc_resp_min_qy 0.045
## engines_assigned 0.719
## ladders_assigned 0.693
## others_units_assigned 0.704
## emergency_min_qy 0.077
## total_assigned_unit 1.000
As we can see, total_assigned_unit is heavily correlated to the other counts since it is the sum of those, thus deciding to remove the sum of units. Continuing we note also that lot’s of time differences are correlated to each other, whoever it is obvious since some of them include other smaller differences, these measures will be managed soon once we deal with the two types of analysis.
fire_data_new <- fire_data_new %>% select(-c(total_assigned_unit))
Next, before creating any model, we have to split the cleaned dataset into train and test, with 80% of the whole dataset for the train set and the remaining 20% for the test set. We want to mention that it would be better to perform cross validation using the entire data but for sake of simplicity and time we decide to opt for the classic train-test split.
set.seed(43)
split <- sample.split(fire_data_new, SplitRatio = 0.8)
# Create training and testing sets
fire_data.train <- subset(fire_data_new, split == TRUE)
fire_data.test <- subset(fire_data_new, split == FALSE)
rownames(fire_data.train) <- NULL
rownames(fire_data.test) <- NULL
dim(fire_data.train)
## [1] 24580 15
dim(fire_data.test)
## [1] 6146 15
Save the cleaned train and test datasets for the next parts of the analysis only if they are not already stored on the machine.
if (!file.exists("datasets/fire_data_clean.train.csv")){
write.csv(fire_data.train, "datasets/fire_data_clean.train.csv", row.names=FALSE)
}
if (!file.exists("datasets/fire_data_clean.test.csv")) {
write.csv(fire_data.test, "datasets/fire_data_clean.test.csv", row.names=FALSE)
}
Please go to the Part 2 - Linear Regression to continue the analysis.